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Abstract 


Many practical techniques for probabilistic 
inference require a sequence of distributions 
that interpolate between a tractable distribu¬ 
tion and an intractable distribution of interest. 
Usually, the sequences used are simple, e.g., 
based on geometric averages between distri¬ 
butions. When models are expressed as prob¬ 
abilistic programs, the models themselves are 
highly structured objects that can be used 
to derive annealing sequences that are more 
sensitive to domain structure. We propose an 
algorithm for transforming probabilistic pro¬ 
grams to coarse-to-fine programs which have 
the same marginal distribution as the original 
programs, but generate the data at increas¬ 
ing levels of detail, from coarse to fine. We 
apply this algorithm to an Ising model, its 
depth-from-disparity variation, and a factorial 
hidden Markov model. We show preliminary 
evidence that the use of coarse-to-fine models 
can make existing generic inference algorithms 
more efficient. 


1 INTRODUCTION 

Imagine watching a tennis tournament. Your visual 
system makes fast and accurate inferences about the 
depth-field (how far away are different patches?), the 
objects (is that a ball or racket?), their trajectories, 
and many other properties of the scene. A powerful 
intuition is that such feats of inference are enabled by 
coarse-to-fine reasoning: first getting a rough sense of 
where the action is in the scene, about how far away 
it is, and so on; later refining this impression to pick 
out details. The appeal of coarse-to-fine reasoning is 
manifold. First, there is introspection: When faced 
with a complex reasoning task, it often helps to take 
a step back, try to understand the big picture, and 
then focus on what seems most promising. The big 


e 


o 


o 




/ 

o 

o oo 

IK] m IK: UB 

O 

■■■■■a 

■ ■■■■SI „ 

■ ■■■■TJ B 

■ BBBBI^ R 


-B 

-10 

-12 

-14 

-16 



Figure 1: Incremental coarsening reduces surprise 
in SMC. Particles (red) are directed towards high- 
probability regions (light) step by step as we refine 
the state space from coarse to fine. The numbers indi¬ 
cate how many particles are associated with a particular 
state. 


picture tends to have fewer moving parts, and its parts 
tend to be easier to understand. Neuroscience pro¬ 
vides another angle: for instance, face processing in 
the high-level visual cortex plausibly follows coarse-to- 
fine principles (Goffaux et al., 2010| , and stereoscopic 
depth perception similarly proceeds from large to small 
spatial scales (Menz and Freeman 20031. Finally, there 
is a rich set of existing applications of coarse-to-fine 
techniques for specific applications in a diverse set of 


areas including physical chemistry (Lyman and Zuck 


erman 


2006), speech processing (Tang et al. 2006), 


PCFG parsing (Gharniak et al., 2006), and machine 
translation (Petrov et al., 2008). Despite the success 
and appeal of coarse-to-fine ideas, they have been dif¬ 
ficult to apply in general settings. Here we propose a 
system for deriving coarse-to-fine inference from any 
model written as a probabilistic program. We do this 
by leveraging program structure to transform the initial 
program into a multi-level coarse-to-fine program that 
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can be used with existing inference algorithms. 

Probabilistic programming languages provide a uni¬ 
versal and high-level representation for probabilistic 
models, separating the burdens of modeling from those 
of inference. Yet the difficulty of inference can grow 
quickly as the state space (number of program execu¬ 
tions) grows large. A widely-used technique for infer¬ 
ence in large state spaces is Sequential Monte Carlo 
(SMC), a class of algorithms based on constructing a 
sequence of distributions, beginning with an easy-to- 
sample distribution and ending with the distribution 
of interest, with each distribution serving as an im¬ 
portance sampler for the next. The success of SMC 
rests on the quality of the approximating sequence. 
We present a generic method for deriving coarse-to- 
fine sequences of approximating distributions from a 
probabilistic program. 


Our approach can be seen as building a hierarchical 
model from an initial model, where each stage of the 
hierarchy resolves more details of the state space than 
the one before. We additionally augment each level of 
the hierarchy with a coarse approximation to the evi¬ 
dence (implemented via heuristic factors, see Section 


3.11, in order to specify a useful conditional distribution 


at each level. In practice we create the hierarchical 
model and heuristic factors at once by specifying how to 
“lift” each element of the program—elementary distribu¬ 
tions, factors, primitive functions, and constants—to 
the coarser levels. The resulting model supports coarse- 
to-fine inference by SMC, where the nth distribution in 
the sequence is simply the state space of the n-coarsest 
levels; this is correct inference for the original model 
because, by construction, the marginal distribution 
over the finest level is the original distribution. 

Model transformations let us directly use existing se¬ 
quential inference algorithms to perform coarse-to-fine 
inference, rather than proposing a new inference al¬ 
gorithm per se. This is in contrast to essentially all 
prior work on coarse-to-fine inference, including Kiddon| 


and Domingos (2011) and Steinhardt and Liang (2014). 


One benefit of this modular approach is that advances 
in SMC algorithms immediately yield improvements to 
coarse-to-fine inference. Another benefit is the concep¬ 
tual clarity that comes from an explicit representation 
of the coarse-to-fine model. 


In the following, we first review probabilistic programs 
and Sequential Monte Carlo. We then describe our 
coarse-to-fine program transform and how it lifts ran¬ 
dom variables, primitive functions, and factors to op¬ 
erate on multiple levels of abstraction. We apply this 
transform to two models in the domain of tracking 
partially observable objects over time given visual in¬ 
formation, a depth-from-disparity model and a factorial 


hidden Markov model, and show preliminary evidence 
that it may help reduce inference time in these do¬ 
mains. Finally, we discuss the current limitations of 
this framework, the circumstances where our approach 
to coarse-to-fine inference is a good fit, and outline 
research questions raised by this new approach. 


2 BACKGROUND 


2.1 PROBABILISTIC PROGRAMS 


Probabilistic programs are models expressed in Turing- 
complete languages that supply primitives for random 


sampling and probabilistic inference (e.g., Goodman 


et al. 2008 Roller et al. 1997 Pfeffer 2007). Many 


existing probabilistic models have been expressed con¬ 
cisely as probabilistic programs. A distinguishing fea¬ 
ture of probabilistic programming as a machine learning 
technique is that it separates inference techniques from 
modeling assumptions. Thus, any advances in algo¬ 
rithms provide benefits for a wide range of applications 
at once. While we demonstrate our technique for a 
small set of models chosen for their pedagogical value, 
we emphasize that the technique can be applied to a 
much wider range of models without modification. 


We express probabilistic programs in WehPPL (Good¬ 


man and Stuhlmiiller 2014), a small probabilistic 


language embedded in Javascript. This language is 
universal, and feature-rich, so we expect the tech¬ 
niques to generalize straightforwardly to other lan¬ 
guages. In this language, all random choices are marked 
by sample; the argument to sample is a distribution 
object (also called Elementary Random Primitive, or 
ERP), its return value a sample from this distribution. 
Calls to functions such as flip(0.5) are shorthand for 
sample(bernoulllERP, [0.5]). 


To enable probabilistic conditioning, the language sup¬ 
ports factor statements. The argument to factor is a 
score: a number that is added to the log-probability 
of a program execution, thus increasing or decreasing 
its relative posterior probability. This includes hard 
conditioning on evidence as a special case (scores 0 
and —cx)). Finally, the language supports inference 
primitives such as ParticleFilter and MH (Metropolis- 
Hastings). Each of these takes as an argument a thunk, 
that is, a stochastic function that itself takes no argu¬ 
ments. And each of these computes or estimates the 
distribution on return values of this thunk (its marginal 
distribution), taking into account the re-weighting in¬ 
duced by factor statements. 

Figure [2a] shows a program that implements a simpli¬ 
fied one-step version of multiple object tracking: the 
noisily observed value 7 could have been produced by 
either x or y, each of which is uniformly chosen from 
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var noisyObserve = function(obs){ 
var score = -3*distance ( obs , 7) 

factor(score) 

} 

var model = function()-[ 

var X = sample(uniformERP) 

var y = sample(uniformERP) 

var observation = flip(.5) ? x : y 

noisyObserve(observation) 

return [x, y] 

} 


var noisyObserve = function(obs){ 
var score = -3*distance (obs , 7) 
factor(score) 


var model = function(){ 

var x sample(uniformERP 

var heuristicScore = -distance(x, 7) 

factor(heuristicScore) 

var y = sample(uniformERP) 

var observation = flip(.5) ? x : y 

noisyObserve(observation) 

factor(-heuristicScore) 

return [x, y] 


(a) A probabilistic program 


(b) Rewritten using heuristic factors 


Figure 2: Two probabilistic programs with the same marginal distribution (shown in the final panel in Figurej^. 


{1, 2,..., 8}. The final panel in Figure shows the 
marginal distribution on [x^y] for this program. 

In probabilistic programs, the same syntactic variable 
can be used multiple times. The prototypical example 
is the geometric distribution: 

var geometric = function!) f 

return flip(O.l) 70:1+ geometric!) 

} 


The call to flip(0.1) may occur an unbounded num¬ 
ber of times. For many purposes, it is necessary to 
distinguish and refer to these different calls. In the 
context of MCMC, Wingate et al. (2011) introduced a 
suitable naming scheme based on stack addresses. The 
address of a random choice is a list of syntactic loca¬ 
tions, one for each function on the function call stack 
at the time when the random variable was sampled. 
We will build on this scheme to associate corresponding 
random choices on different levels of coarsening with 
each other, and use address in the following to refer 
to the current stack address. 


2.2 SEQUENTIAL MONTE CARLO 

Suppose our target distribution is X with probabil¬ 
ity mass function p. Importance sampling generates 
samples from an approximating distribution Y (with 
probability mass function q) and re-weights the samples 
to account for the difference between true and approx¬ 
imating w{x) = p{x)/q{x). To compute estimates of 
1/) = E 2 „..x[/(a:)] given samples yi, we use 

To generate approximate samples from p{x), we re¬ 
sample from the set of samples in proportion to the 
importance weights. 


If we iterate this procedure with a sequence of ap¬ 
proximating distributions qi,... ,qk, we get Sequential 
Importance Sampling. If we resample at each stage, we 
get Sequential Importance Resampling. If we addition¬ 
ally apply MCMC “rejuvenation” steps at each stage i 
with a transition kernel that leaves the distribution qi 
invariant, we get Sequential Monte Carlo. 


For Sequential Importance Sampling, the sum of the 
KL divergences between successive distributions con¬ 
trols the difficulty of sampling (Freer et ah, 2010). If 
we can sample from the right coarse-grained distribu¬ 
tions, we can reduce this difficulty, as illustrated in 
FigureWith rejuvenation steps (SMC), the picture 
is more complex, but empirically, it is still the case 
that distributions that are closer together in KL gener¬ 
ally make the sampling problem easier. In particular, 
we expect that good coarse-to-fine sequences lead to 
better coverage of regions with high posterior proba¬ 
bility, and that they enable more efficient pruning of 
low-probability regions. A finite set of fine-grained par¬ 
ticles may not cover the entire region, which can lead to 
a situation where all particles assign low probability to 
the next filtering step (particle decay). A particle that 
has not been refined yet corresponds to distributions 
on fine-grained states, thus each such particle can cover 
a bigger region (Steinhardt and Liang 2014). Good 
coarse-to-fine sequences can allow us to prune entire 
parts of the state space in one go, only considering re¬ 
finements of abstract states that have sufficiently high 
posterior probability (Kiddon and Domingos 2011). 


3 ALGORITHM 

Given a probabilistic program, our algorithm builds 
a coarse-to-fine program with the same marginal dis¬ 
tribution as the original program, but with additional 
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latent structure corresponding to coarsened versions of 
the program. 

We will assume that the user provides a coarsenValue 
function that describes how values map to more ab¬ 
stract values. Iterating this function leads to multiple 
levels of coarsened values. Our goal then is to construct 
a version of the original program that operates over 
values coarsened N times. We will preserve the basic 
flow structure of the program, and thus we only need 
to specify how each primitive construct in the program 
is lifted to the space of coarsened values. The tricky 
part is to construct these lifted components such that 
the final marginal distribution is preserved. We use 
two ideas to accomplish this. First, we replace each 
unconditional elementary distribution at a given loca¬ 
tion with a distribution that depends on the coarser 
value of the same location, but such that the marginal 
over this coarser value yields the original distribution. 
Second, we treat lifted factors as only approximations 
useful for guiding inference, which are then canceled by 
an extra factor inserted at the next-finer level. With 
this scheme, only the finest-level factors contribute to 
the final score. This gains us flexibility over the lifting 
of primitive functions: lifted functions (that ultimately 
flow only to factor statements) only need to have similar 
behavior to their original; deviations will be corrected 
by the cancellation of factors. 

In the next few subsections, we introduce heuristic 
factors, the inputs that the program transform requires, 
how the model syntax is transformed, and how each of 
the components of the lifted model works: constants, 
random variables, factors, and primitive functions. 


3.1 HEURISTIC FACTORS 


A heuristic factor is a factor that is introduced for the 
purpose of guiding incremental inference algorithms 
such as particle filtering and best-first enumeration 
(Goodman and Stuhlmiiller, 2014). Its distinguishing 
characteristic is that an equivalent, canceling factor 
is inserted at a later position in the program in order 
to leave the program’s distribution invariant. In other 
words, the pair of statements f actor (s) and factor (-s) 
together has no effect on the meaning of a model; its 
only effect is in controlling how inference algorithms 
explore the state space. 


For example, Figure |2b| shows a way to rewrite the 
program in Figure |2a| in a way that initially assigns 
higher weight to program executions where x is close 
to the true observation 7. This is a heuristic, since— 
depending on the outcome of the coin flip—it may be 
y which is observed, in which case there is no pressure 
for X to be close to 7. 


The coarse-to-fine transform introduces heuristic fac¬ 


tors that guide sampling on coarse levels towards high- 
probability regions of the state space without changing 
the program’s distribution. 


3.2 PREREQUISITES 

The main inputs to the transform are a model, given as 
code for a probabilistic program, and a pair of functions 
coarsenValue and refineValue. 

The main constraint on the model is that all ERPs 
need to be independent, i.e., do not take parameters 
that depend on other ERPs. If the support of each 
ERP is known, this can be achieved using a simple 
transform that replaces each dependent ERP with a 
maximum-entropy ERP, and adds a dependent factor 
that corrects the score. That is, we transform 

var X = sample(originalERP , params) 
to 

var X = sample(maxentERP) 
factor(originalERP.score(x, params) - 
maxentERP.score(x)) 


This transform leaves the model’s distribution un¬ 
changed and greatly simplifies the coarsening of ERPs, 
but reduces the statistical efficiency of the model. 
This statistical inefficiency can potentially be ad¬ 
dressed by merging sample and factor statements into 
sampleWithFactor (Goodman and Stuhlmiiller 2014) 
after the coarse-to-fine transform =. 


The model is annotated with the name of the main 
model function (which defines the marginal distribu¬ 
tion of interest) and a list of names of ERPs, constants, 
and functions (compound, primitive, score, and poly¬ 
morphic; see below) to be lifted. 

The main parameters that control the coarse distribu¬ 
tions are the user-specified functions coarsenValue and 
refineValue. The function coarsenValue maps a value 
to a coarser value; the function refineValue maps a 
coarse value to a set of finer values. To generate val¬ 
ues on abstraction level i, we iterate the coarsenValue 
function i times. We require that the two functions 
are inverses in the sense that v G ref ineValue(U) 
coarsenValue(r>) = V for all V and V. 

If a model has multiple different types of variables for 
which inference is needed, it is easy to define a poly¬ 
morphic coarsening function that implements different 
behaviors for different types of values. In cases where 
different variables have the same type but different 
meaning or scale, we recommend using wrapper types 
to control which coarsening is used. 

This paper does not address the task of finding good 
value coarsening functions. Instead, we ask: if such a 
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var 1if t edUnif ormERP = 1if tERP(uniformERP) 
var liftedDistance=liftScorer(distance) 

var noisyObserve - function(obs){ 

var score = -3*liftedDistance(obs, 7) 
liftedFactor score) 

} 

var model = function(){ 

store.base = getStackAddress() 

var X lifted iniformERP() 

var y = liftedUniformERP() 

var observation = flip(0.5) ? x : y 

noisyObserve(observation) 

return [x, y] 

} 

var coarseToFineModel = function(level){ 
store.level = level 
var marginalValue = model() 
if (level === 0) { 

return marginalValue 
} else { 

return coarseToFineModel(level - 1) 

} 

} 

Figure 3: The coarse-to-fine model corresponding to the 
model in Figure!^ This model has the same marginal 
distribution as and but samples it using the 
hierarchical process shown in Figure 

function is given, how can we use it to coarsen entire 
programs so that we produce a sequence of coarse-to- 
fine models that is useful for SMC? 

3.3 MODEL TRANSFORM 

The transform adds a wrapper coarseToFineModel that 
calls the model once for each coarsening level, from 
coarse to fine, each time setting the (dynamically 
scoped) variable store.level (in the following, level) 
to the current level. The transform also replaces all 
ERPs, factors, primitive functions, and score functions 
with lifted versions that act differently depending on 
level. The coarse models only affect the hne-grained 
models through the side-effect of storing the values of 
their random choices and the weights of their factors 
in store, which is used by the finer-grained models to 
conditionally sample their random choices and compute 
their factor weights. 

The syntactic transform itself proceeds as follow: 

1. For each ERP, primitive and score function, insert 
the corresponding lifted definition before the model 
definition. For example: 
var liftedPlus = liftPrimitive(plus) 


2. Rename all ERPs, factors, primitive and score 
functions to their corresponding lifted names in 
model and compound functions. For example, 
replace plus with liftedPlus. 

3. Wrap all constants. For example, replace c with 
liftConstant(c). 

4. As the first statement in the model, store the cur¬ 
rent address, which is needed to compute relative 
addresses of random choices and factors later on: 
store.base = getStackAddress() 

5. Add a wrapper coarseToFineModel that calls the 
model once for each coarsening level (see Figure 

i- 

We will now describe the mechanisms behind lifted 
constants, random variables, factors, and primitive 
functions. 

3.3.1 Lifting Constants 

To lift a constant, we simply repeatedly coarsen it to 
the current level: 


Algorithm 2: Lifting constants 

procedure liftedConstant(c) 
for i=0; i < level; i-\—\- do 
c = coarsenValue(c) 

end for 
return c 
end procednre 


3.3.2 Lifting Random Variables 

We lift each random variable to a sequence of variables, 
from coarse to fine, such that (1) on the coarsest level, 
we unconditionally sample from the original distribu¬ 
tion, coarsened N times; (2) at each level n < N, we 
sample from the set of valid refinements of the value 
of the next-coarser variable; (3) marginalizing out all 
coarser variables, the distribution at the finest level 
recovers the distribution of the original uncoarsened 
variable. 

Let Dq be the domain of an ERP with distribution 
Po{x), and let = cv"'(Do) be the set of values arrived 
at by repeatedly applying the coarsenValue function 
(written cv for short). We would like to decompose the 
original distribution po{x) into a sequence of conditional 
distributions q{xn\xn+i) for random variables Xn € Dn- 
If we take: 

^(^n|^n-t-l) OC ^ '^^ P{.^o)^CV‘^(xQ) = Xn/\Cv(Xn)—Xn+l 

Xq 
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Algorithm 1: Lifting ERPs 

procedure SAMPLELlFTEDERP(eo, 0 
vi = store[erpName(address, I + 1)] 
if vi is undefined then 
if Z is 0 then 

return eo.sample() 
else 

return coarsenValue(sanipleLiftedERP(eo, I — 1)) 

end if 
else 

V = refineValue(wi) 

p = u.map(A(v){return getERPScore(eo, u, Z)}) 
return sampleDiscrete(F, p) 

end if 

end procedure 
procedure LlFTERP(eo) 

ei = makeERP(A(){SAMPLELiFTEDERP(eo, level)}) 
return A(){ 

V = saniple(ei) 

store[erpName(address, level)] = v 
return v 

} 

end procedure 


and for the coarsest level, N, 

^ ^ P(^o)^cv^fxn)— Xjv 
xo 

then it is clear that we preserve the marginal distribu¬ 
tion on Xq. That is: 

p{xo) = ^ q{xo\xi) ■ ■ ■ q{xN-i\xN)q{xN)- 

Xi,...,Xn 


Algorithm 1 shows how we implement sampling from 
such a decomposed ERP at a given level. Note that, to 
look up the existing value at the next-coarser level, we 
identify random variables on different levels based on 
relative stack addresses (via erpName); this is critical for 
models such as grammars with an unbounded number 
of random choices. The implementation for computing 
the score of lifted ERPs is analogous (although this 
score is not needed for pure particle filtering, so we 
omit the details). Both are parameterized by a function 
getERPScore that estimates the total probability of the 
equivalence class of values that map to a given coarse 
value. These ERP scores for coarse values can be 
estimated by sampling refinements, via user-specified 


scoring functions (as in Section 4.1), or using exact 


computation (as in Section 4.2). 


3.3.3 Lifting Factors 


we ensure that the overall distribution of the model 
remains unchanged—ultimately, only the base-level 
factors count. This incremental scoring process formal¬ 
izes the intuition of increasing attention to detail as 
we move down the abstraction ladder. Like random 
variables, we identify factors based on relative stack 
addresses. 


Algorithm 3: Lifting factors 

procedure liftedFactor(s) 

Si = store[factorName(address, level + 1)] V 0 
factor(s — Si) 

store[factorName(address, level)] = s 
end procedure 


3.3.4 Lifting Primitive Functions 

When a primitive function / is applied to a base-level 
value, it is deterministic. Now we are interested in 
lifting primitive functions to operate on coarse values. 
However, each coarse value corresponds to a set of 
base-level values. For different elements in this set, / 
may return different values. This suggests that lifted 
versions of / may be stochastic. We wish to preserve 
the marginal distribution of the entire program, but 
because we have required ERPs to be unconditional 
and treat coarse factors as canceling heuristics, we have 
some latitude in how to lift the primitive functions. 

Algorithm 4 shows one approach to the computation 
of such coarsened primitives. This algorithm is pa¬ 
rameterized by a function marginalize, which may be 
implemented using exact computation, sampling, etc, 
and which may cache its computations. 


Algorithm 4: Lifting primitive functions 

procedure liftPrimitive(/) 
return A(a;){ 

e = marginalize(A(){ 

Xo = a;.map((uniformDraw o refineValue)*®''®’) 
V = f(xo) 

return coarsenValue’®''®’(v) 

}) 

return sample(e); 

} 

end procednre 


Scoring functions—those that directly compute a score 
to be consumed by factor—are a special case of prim¬ 
itive functions. We know that they return a number, 
so instead of sampling from the return distribution, 
we can simply take the expectation. We find that this 
leads to more stable, and hence useful, heuristic factors. 


We treat the lifted counterpart to factors as heuristic 
factors: the score on the next-higher level is subtracted 
out on the current level. By canceling out these fac¬ 
tors when inference proceeds to finer-grained levels. 


Algebraic data type constructors are another special 
case. In many circumstances (including Section 4.2), 
they can be treated as transparent with respect to 
coarsening. For example, it is frequently useful to 
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define coarsenValueC [a;i, X 2 , . •.]) as equivalent to [ 
coarsenValue(®i) , coarsenValue(a; 2 ) , . . .] . More gen¬ 
erally, if a function can apply to coarsened objects 
directly, it can be marked as polymorphic. In that 
case, no lifting is necessary. For example, if we have 
a function that computes the mean of a list of num¬ 
bers, and if our coarsening maps lists of numbers to 
shorter lists (without changing the type of the list ele¬ 
ments), then we have the option to mark this function 
as polymorphic. 


4 EMPIRICAL EVALUATION 


We introduced two complementary ideas: (1) Coarse-to- 
fine SMC (which can be useful even for a single random 
variable with a big state space, given an appropriate 
coarsening function) and (2) constructing coarse-to- 
fine models from probabilistic programs (using a value 
coarsening function to lift each of the components of 
a probabilistic program to operate on coarsened val¬ 
ues without changing the marginal distribution of the 
program). 

Our experiments mirror this structure: In the first set of 
experiments, we evaluate the benefits of coarse-to-fine 
SMC on an Ising model and its depth-from-disparity 
variation. These models use a single matrix-valued 
random variable and a single global factor (the energy 
function). 

In the second set of experiments—on a Factorial Hidden 
Markov Model—we make liberal use of probabilistic 
program constructs: the model is defined as a recursive 
stochastic function, and the state transition function 
is implemented using the higher-order function map. 
This set of experiments fully exercises our program 
transform, including the identification of corresponding 
coarse and fine random variables using relative stack 
addresses. 


All models are expressed as programs, and all experi¬ 
ments are implemented within the same framework. For 
each experiment, we show how the average importance 
weight—essentially a lower bound on the normalization 
constant (Crosse 2013)—behaves over time. 


Note that these experiments are preliminary: They do 
not yet employ rejuvenation steps, and hence only test 
the quality of the sequence of coarse-to-fine distribu¬ 
tions in the setting of Sequential Importance Resam¬ 
pling. While the use of a coarse-to-fine sequence of 
importance distributions results in better performance 
than baseline, we have only evaluated the performance 
compared to a relatively weak baseline (importance 
sampling for the MRF models, particle filtering for the 
Factorial HMM model), and expect that the results 
are not impressive on an absolute scale. Indeed, as 


described in Section we believe that some modifi¬ 
cations and further developments are needed to make 
this approach useful in practice. 


4.1 MARKOV RANDOM FIELDS 


A number of applications in physics, biology, and com¬ 
puter vision can be modeled as Markov Random Fields 
(MRFs). These problems are unified in specifying a 
global energy function which, by virtue of the Markov 
property, depends only on the local neighborhoods of 
elements. Once this energy function is specified, it 
can be difficult to minimize; specialized optimization 
algorithms have been developed for particular domains 
(Szeliski et al. 20081, but there is no generally applica¬ 
ble solution. 


The local neighborhood structure of the energy func¬ 
tion, however, suggests that a coarse-to-fine transfor¬ 
mation may be useful: if neighborhoods are coarsened 
into single representative values, then the energy can be 
minimized in this smaller space, using heuristic factors 
to guide search in the original space. We do not claim 
that our model transformation constitutes a solution in 
itself, but can be used in tandem with other algorithms 
to effectively reduce the search space. In this situa¬ 
tion, we demonstrate the coarse-to-fine transformation 
on two simple MRFs: the Ising model and the stereo 
matching task. 


4.1.1 Ising model 


Coarse-to-fine transformations have a long history of 
applications in physics. When studying systems which 
interact across multiple orders of magnitude, such as 
fluids, ferromagnets, and metal alloys, it is intractable 
to work at the most fine-grained level. Since exact 
solutions do not exist, physicists developed a method 
called lYie renormalization group (Wilson 1975 19791, 
which effectively maps the fine-grained representation 
of a system onto a coarser but identically parameterized 
representation with similar properties. 


One of the simplest testbeds for renormalization group 
methods is the 2-dimensional Ising model. The state 
space is an n X n lattice of cells, each of which can take 
one of two spin values, cfi G {—1, +!}• The energy of 
a particular configuration of spins a is given by the 
Hamiltonian: 

'W(ct) = J^Cr^CTj 
(A) 


where J is the interaction constant and {ij) indicates 
summing over all possible pairs of neighbors. Note 
that the number of possible configurations grows ex¬ 
ponentially in n, rendering an exhaustive search for 
low-energy states impossible. 
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n = 27, T = 1 


n = 27, T = 2.39 


n = 10, m = 40, d = 15 



Seconds 




(a) Ising at low temperature (T = 1) 


(b) Ising at the critical state (T = 
2.39) 


(c) Depth-from-disparity 


Figure 4: Quantitative inference results for Markov Random Field models 


The interaction constant can be written as J = 1/T 
where T is the temperature of the system. The config¬ 
uration distribution takes different forms at different 
temperatures: we will conduct experiments at T = 1, a 
low-temperature condition where spins prefer to glob¬ 
ally align, and T = 2.39, the critical temperature, where 
long-range correlations dominate. Above the critical 
temperature (e.g. for T > 3), cells become uncoupled 
and the energy distribution across configurations con¬ 
verges to uniform, so we focus on lower temperatures. 

We implemented the Ising energy-minimization prob¬ 
lem as a simple probabilistic program, which first sam¬ 
ples a set of spins and then factors based on the energy 
of that configuration. To apply our coarse-to-fine trans¬ 
formation, we used the spin-block majority-rule for 
our coarsenValue and refineValue functions. To 
coarsen, this rule replaced each 3x3 sub-lattice with 
its modal value (see Figure [^. To refine a single cell 
in the coarse matrix, we considered the space of all 
256 possible 3x3 matrices that could coarsen to that 
value. Note that our sequential refinement - making 
many small choices instead of one big one - differs from 
the typical renormalization group approach, which si¬ 
multaneously replaces all sublattices and reweights the 
interaction constant J accordingly. 

For example, consider a fine-grained value such a^ 


0 1 0 0 0 0 
110 10 0 
11110 0 
0 110 0 0 
0 0 0 0 0 1 
0 110 11 


'^We use spins {1, 0} here instead of {1, —1} to make the 
matrices more readable. 


Coarser values are partially-coarsened matrices, which 
can be represented as pairs of matrices, e.g. 
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where each entry in the second matrix corresponds to 
a 3x3 block in the first matrix. In each coarsening step, 
we only coarsen a single such block. As a consequence, 
there are 90 (= 81 -I- 9) steps when we incrementally 
coarsen a 27x27 matrix first to 9x9, then to 3x3. 


To facilitate this sequential refinement, we implemented 
a polymorphic energy function, which can directly re¬ 
turn a score for such partially coarsened matrices with¬ 
out needing to refine all the way down to the most 
fine-grained level. When this energy function is applied 
to a matrix that contains cells at different levels of 
abstraction, the energy computation counts a single 
coarse cell as a neighbor for each surrounding finer cell 
(and vice versa). 


We ran two experiments on 27 x 27 lattices to demon¬ 
strate the performance of our coarsened program at 
different levels of coarsening. In our first experiment, 
we set the temperature to T = 1 and average 10 runs 
for both coarse-to-fine filtering (with 90 and 30 levels) 
and flat importance sampling. The 90 level coarsening 
condition fully reduces the 27 x 27 lattice to a 3 x 3 
lattice, and the 30 level condition yields a partially 
coarsened matrix. In our second experiment, we set 
the temperature to T = 2.39 and run the same set of 
conditions. Figures 4a and 4b show the average im¬ 
portance weight for different levels of coarsening. We 
see that even intermediate levels of coarsening perform 
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Figure 5: Coarsening at the critical state (T = 2.39) 


better than no coarsening, and that a full coarsening 
performs dramatically better than the other conditions. 
The best solution from the second experiment is shown 
in the right-most panel of Figure]^ along with the two 
coarser configurations from which it was refined. This 
displays the characteristic local structure of the Ising 
model at critical temperatures. 


4.1.2 Stereo matching 


Another common application of MRFs is the stereo 
matching task (]Boykov et al. 2001 Scharstein and 


Szeliski 20021. The goal is to estimate the disparity 


between two images, X and I', captured from slightly 
shifted viewpoints. This disparity map can be used to 
recover a rough measure of depth. As in the Ising case, 
we implement this task as a probabilistic program by 
sampling a lattice of disparity values, and factoring on 
its energy. 


The energy function for a particular set of disparity 
values has two parts: (1) a smoothing term penalizing 
distance between the values of neighbors and (2) a data 
cost term penalizing each particular disparity value 
for discrepancies with the true data (as measured by 
comparing the difference in pixel intensities at the given 
discrepancy): 


'H{d)= Vp,g(dp,dg)+^C(p,dp) 

{p,q}eJ^ P 


We denote the intensity of pixel p in image X by Ip. 
Since corresponding pixels should have similar intensi¬ 
ties, we set our data cost term C{p, dp) as suggested by 


Boykov et al. (2001), taking the absolute difference be¬ 


tween Ip and Xp^^ . To reduce sensitivity to variability 
in image sampling, we interpolate between neighboring 
intensities in the neighborhood x £ {d — 0.5, d -I- 0.5) 
and take the minimum. For our smoothing function, 
we use the truncated squared error: 

Fp,g(dp, dq) = min(((ip dg) , VAax) 


with Fmax = 5. 

We implemented energy minimization for the stereo 
matching model analogous to the Ising model, but with 
different coarsening and refinement functions: coars¬ 
ening replaces a 2 x 2 sublattice with its mean and 


standard deviation; refinement returns the set of all 
possible 2x2 lattices with the given mean and stan¬ 
dard deviation that could have occurred at the given 
abstraction level. 


Figure]^ shows that in a comparison between coarse-to- 
fine SMC and importance sampling, SMC finds lower- 
energy states more efficiently for a 10x40 cropped pair 


of images from the Middlebury dataset (Scharstein and 


Szeliski, 2002), although we expect that the absolute 


quality of the states found is not very good for either 
algorithm. 


4.2 FACTORIAL HMM 


In our second example, we test the hypothesis that 
abstractions are useful as a means to avoid particle 
collapse in large state spaces. For this reason, we chose 
the Factorial HMM, a model with a large effective 
state size even within a single particle filter step. The 
Factorial HMM is a HMM where the state factors into 
multiple variables (Ghahramani and Jordan 1997). If 
there are M possible values for each latent state, and 
k state variables per time step, then the effective state 
size is M^. If only few of these have high probability, 
then even for moderate M and k it is possible that 
there are not sufficiently many fine-grained particles to 
cover all regions of high posterior probability. 


In our first experiment, we use a Factorial HMM with 3 
variables per step, 256 possible state values per variable, 
and 6 observed time steps. We run both coarse-to- 
fine filtering (with 6 and 8 abstraction levels) and flat 
filtering and average 10 runs. 


We coarsen the Factorial HMM by merging some state 
and observation symbols. To test the hypothesis that 
coarse-to-fine inference will work best when abstrac¬ 
tions match the dynamics of the model, we generate 
transition and observation matrices with approximately 
hierarchical structure as follows. Enumerate state 1 to 
N. For states i and j, we let the transition probability 
be approximately proportional to Similarly, 

for state i, the probability of generating observation k 
is proportional to 


Each state consists of three substates, and each sub¬ 
state is chosen from {1, 2,..., 256}. Coarsening maps 
numbers to successively greater intervals. Eor example, 
coarsening maps (5,11,56) to ([5, 6], [11,12], [55, 56]), 
and on the next-coarser level to ([5,8], [11,14], [55, 58]). 
Eigure[^ shows that plain particle filtering consistently 
underestimates the true normalization constant relative 
to coarse-to-fine filtering. 

In our second experiment (Eigure[fib|, we compare the 
behavior of flat and coarse-to-fine filtering as the num¬ 
ber of HMM states increases from 2^ to 256^. As before, 
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(a) SMC using a coarse-to-fine model finds more proba¬ 
ble samples earlier on than SMC without coarsening. 


Figure 6: Inference 



Number of HMM states 


(b) For small numbers of states, coarse-to-fine is indis¬ 
tinguishable from plain particle filtering. As the number 
of states grows, coarse-to-fine is able to provide better 
solutions in the same amount of time. 

for a Factorial HMM 


we keep the runtime constant. For small numbers of 
states, plain and coarse-to-fine SMC give very similar 
estimates. As the number of states grows, the differ¬ 
ence between coarse-to-fine and plain filtering grows 
as well, indicating that coarse-to-fine is most useful in 
large state spaces. 

5 DISCUSSION 

5.1 WHEN DOES COARSE-TO-FINE 
INFERENCE HELP? 

It is generally difficult to compute or estimate the pos¬ 
terior probability of a set of (program) states. However, 
this is precisely what is required for coarse-to-fine infer¬ 
ence to work: when we evaluate a program on a coarse 
level, we need to estimate for each coarse value how 
likely its refinements are under the posterior. This sug¬ 
gests that settings where coarse-to-fine inference works 
have special characteristics that make such estimation 
feasible. We now name a few. 

First, the given program may satisfy independence as¬ 
sumptions that make estimating posterior probabilities 
feasible. For example, for the program shown in Figures 
and the score function only depends on one of x 
and 2 / at a time; hence, we can independently compute 
the estimated score for the refinements of x and y, and 
use this information in computing the estimated scores 
for abstract values of both. 

Second, we may be in a setting where the type of 
a coarse value matches the type of its refinements. 
In that situation, “polymorphic” score- and primitive 
functions may be a cheap heuristic for estimating the 


posterior probability of a coarse state. For the Ising 
model, the energy function satisfies this criterion up to 
parameterization. 


Third, coarse-to-fine may be particularly useful in the 
amortized setting (Stuhlmiiller et al. 2013). Learn¬ 
ing the conditional distributions associated with lifted 
primitive functions is one instantiation of “learning to 
do inference”. This is particularly feasible for smooth 
state spaces, where one can effectively estimate entire 
distributions from a few samples. 


Another answer to the question of when coarse-to-fine 
helps is to point out that this depends on what inference 
algorithm is used. For inference by enumeration, exact 
coarsening (i.e., coarsening within values that have 
the same posterior probability) is useful for increasing 
computational efficiency. By contrast, for sequential 
Monte Carlo methods, it is frequently more desirable to 
merge states with different posterior probability, as this 
smoothes the state space and thus increases statistical 
efficiency. 


5.2 CURRENT LIMITATIONS 


The system as presented has three main limitations. 
We will now describe these limitations and outline how 
they can be addressed. 


First, as discussed in Section |3.2[ the coarse-to-fine 
transform can only be applied to models where all 
ERPs are independent. We have described how to 
transform models into this form and referred to the 
technique of merging sample and factor statements 
described in Goodman and Stuhlmiiller (2014) as a 
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tool for recovering statistical efficiency lost in such a 
transform. We have not yet employed this merging in 
our experiments, which makes the results more difficult 
to interpret. We expect that it would be feasible to 
develop a version of the coarse-to-fine transform that 
directly operates on dependent ERPs. 

Second, single-site MCMC rejuvenation steps are of 
very limited use in the current setup. This is due 
to the combination of (a) computing coarsened val¬ 
ues by binning fine-grained values (using the user- 
provided coarsenValue function) and (b) using this 
coarsening to build a hierarchical model to which stan¬ 
dard SMC algorithms can be applied. In this hierarchi¬ 
cal model, each fine-grained value v only has non-zero 
probability if the corresponding coarse value is set 
to coarsenValue (n). As a consequence, we reject all 
MCMC steps that change this coarse-grained value to 
anything else, and likewise all steps that change v to v' 
such that coarsenValue (a) 7 ^ coarsenValue (aO • These 
strong dependencies between levels of abstraction can 
be avoided if we only use each level in the sequence 
of coarse-to-fine models as an importance sampler for 
the next level instead of constructing a coarse-to-fine 
model. 


rections, including (1) understanding the relation to 
abstract interpretation, and Galois Connections specif¬ 


ically (e.g., Cousot and Monerau 2012 Monniaux 


20001, (2) automatically deriving coarsenings for hier¬ 


archical Bayesian models, (3) learning good coarsenings, 
and efficient learning of approximations for coarsened 
primitive and score functions, and (4) coarsening (merg¬ 
ing) multiple variables across blocks, potentially via 
flow analysis. 


We expect that the most interesting applications of 
coarse-to-fine approaches to efficient inference are yet 
to come. 
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